On the structure of genealogical trees in the 
a^ ; presence of selection 

> 

^ ■ P. R. A. Campos, M. T. Sonoda, J. F. Fontanari ^ 



B 



C^ 






Instituto de Fisica de Sao Carlos, Universidade de Sao Paulo, Caixa Postal 369, 

13560-970 Sdo Carlos, SP, Brazil 



Abstract 



c/3 ' We investigate through numerical simulations the effect of selection on two sum- 

mary statistics for nucleotide variation in a sample of two genes from a population 
of A^ asexually reproducing haploid individuals. One is the mean time since two 
individuals had their most recent common ancestor (Tg), and the other is the mean 
'^ ■ number of nucleotide differences between two genes in the sample {dg). In the case 

of diminishing epistasis, in which the deleterious effect of a new mutation is attenu- 

Q I ated, we find that the scale of dg with the population size depends on the mutation 

rate, leading then to the onset of a sharp threshold phenomenon as A^ becomes 

large. 
> 

o, 
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1 Introduction 



Q ! The rapid accumulation of DNA sequence data in the last two decades re- 

sulted in a shift of emphasis in population genetics from a prospective ap- 
IT^ ■ proach, which focuses on the changes in the population composition with 

j_j ' time, to a retrospective approach, which explores the patterns of similarities 

between the different sequences to obtain information about the evolution- 
ary history of those sequences. Although neutral genealogical processes have 
been the subject of much attention in those decades, culminating with King- 
man's Coalescent Theory [1] (see [2,3] for reviews), very little is known about 
genealogical processes with selection (see [4-6] for a few exceptions). The pur- 
pose of this paper is to study numerically the effect of selection on two widely 
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used summary statistics for nucleotide variation in a random sample of two 
genes, namely, the mean time since their most recent common ancestor, T^, 
and the average distance (measured by the number of different nucleotides 
in homologous sites) between them, dg- This last quantity is easily measured 
from DNA sequence data and, within the neutral evolution assumption, can be 
used to estimate the product between the effective population size A'^e and the 
mutation rate per gene per generation U [2]. In this paper we show that this 
estimation procedure holds also in the case of diminishing epistasis, provided 
that U is not too small. 



2 Model 



We consider a haploid population of A^ individuals or genes that evolves ac- 
cording to the discrete-time Wright-Fisher model with selection and mutation 
[7]. Here, haploid means that each individual has only one copy of each chromo- 
some as, for instance, the mitochondrial genes which are inherited maternally. 
In this sense, we will use interchangeably the term gene and individual to refer 
to the unit of selection. An individual is represented by an infinite sequence 
of sites, each labeled or 1: the bit denotes the original (ancestral) type, 
and the bit 1 a mutant type. The fitness of an individual with k mutations 
is Wk = (1 — s) where s G (0, 1) is the selective advantage per site of the 
original nucleotide type, and a > is the epistasis parameter. The case a = 1 
corresponds to absence of epistasis, i.e., each new mutation reduces the fitness 
of the individual by the same amount, irrespective of the number of previous 
mutations. The case a > 1 (synergistic epistasis) models the situation where 
the disadvantageous effect of a new mutation increases with the number of 
mutations already present, while the case a < 1 (diminishing epistasis) cor- 
responds to the situation where the deleterious effect of a new mutation is 
attenuated. The mutation mechanism is such that a mutant offspring gets a 
mutation at a single new site that has never before seen a mutation. In par- 
ticular, we assume that the probability that k new mutations occur in one 
individual is given by the Poisson distribution 



M, = e-^ 
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where U is the mean number of new mutations per individual per generation. 
The relevant quantities that characterize the population are measured after 
the procedures of selection and mutation, in that order. The model presented 
above is the celebrated infinite-sites model [8,9] which has been widely used 
by population geneticists to describe the DNA variability observed in samples 
of genes in the case of neutral mutations (s = 0). 



3 Analytical results 



In the neutral limit (s = 0) as well as in the strong-selection limit {s = 1) 
we can easily calculate Tg and dg analytically [5,10]. Clearly, the value of the 
epistasis parameter is irrelevant in those limits. 

We consider first the neutral limit. Let T^p be the time in generations since 
the latest common ancestor of individuals a and f3. In the following we will 
calculate the probability Po{T) that two randomly chosen individuals have 
Tai3 = T. The notation (...) stands for an average over independent popula- 
tions. The probability that two individuals have no common ancestor in the 
preceding generation is simply (1 — 1/A^). So for A^ large the probability that 
their ancestors, t generations ago, are all different is e"*/^. Hence the proba- 
bility that the latest common ancestor of the two individuals lived exactly at 
T generations ago is given by 
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from where we obtain Tq = N. The relevant time scale in the neutral case is 
thus proportional to A^. The probability distribution Pg (T), which determines 
the statistical properties of the genealogies, depends on such factors as the 
population size, geographic structure and the distribution of fitness among the 
individuals. It should be stressed that neutral mutations (i.e., mutations that 
do not affect the fitness of the individuals) have no effect on the genealogies of 
random samples [2] and so Eqn. (2) holds true irrespective of the value of the 
mutation rate U. Of course, the distance (the number of different nucleotides) 
between two sampled sequences depends strongly on the mutation process. For 
instance, for ?7 = all sequences in the sample are identical. In the sequel we 
will calculate analytically the distribution of distances between two sampled 
sequences. 

We assume that during each time interval dt [^j each sequence has a probability 
Udt of mutating to a new one that has never before been present in the 
population. Thus, in the neutral case the probability 0°(t) that a sequence 
differs from its ancestral on n sites after the divergence time t obeys the 
equation 



dt 



0°-iW-0°(t)l (3) 



^ This continuous-time formulation yields the same results as the discrete-time 
model presented before in the limit of large N, since in that case the relevant time 
scale is of order of A^ generations, and so it is much larger than the time unit, i.e., 
one generation. 



whose solution is the Poisson distribution 



0°(t)=e-^*^. (4) 

n! 



Hence the average distance between an individual and its ancestor increases 
linearly with time uq = Ut. In a more general context the steady accumulation 
of unfavorable mutations in an asexual population is referred to as the Muller's 
rachet [5]. 

Let W^ be the probability that the distance between two sampled individuals 
is equal n. Since in an asexual population the individuals are all descended 
from a common ancestor at some point in the past we can write 

oo 

Wl = j dTP,{T)<Pl{2T). (5) 



Using Eqns. (2) and (4), the integral can easily be evaluated yielding 



.0^ A 

(A + 1) 



K = T^--^ (6) 



where A = 1/2UN. Hence, do = 1/A = 2UN. 

We turn now to the analysis of the strong-selection limit (s = 1). Since only 
individuals with k = mutations can generate offspring there is no difference 
in the fitness of the breeding individuals. This limit is similar to the neutral 
limit in the sense that the probability of two individuals having the same 
parent is 1/Nq where Nq is the number of individuals with k = mutations. 
Clearly, at any generation Nq is random variable distributed according to the 
binomial distribution 



Bm= e-^^o l-e-^V'-^r (7) 




If U is of order 1 and A^ is large we have A^o ~ ^o = Ne~^ so that Ti = A^o- 
The probability distribution of the distance di between two sampled individ- 
uals is equally easy to obtain: it is simply given by M2di so that di = 2U . We 
must note that since the probability of extinction in one generation is B (0), 
the population will ultimately become extinct in a time of order of 1/ B (0). 
However, if U is not of order of A^ the extinction times become so large that 
these events cannot be observed in the simulations described in the sequel. 
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Fig. 1. (a) Mean time since the most recent common ancestor of two randomly 
chosen individuals as a function of the mutation rate U for a = s = 0.5. (6) Average 
distance between two sampled individuals as a function of U. The convention is 
AT = 30 (O), 50 (A), 100 (x), and 300 (y)- The dashed curve is exp(-C//s) while 
the solid lines are guides to the eyes. 

4 Simulations 



At any given time we keep track of tlie number of mutations ki on eacli individ- 
ual i = 1, . . . ,N, as well as of the identity of their parents. This information 
allows us to obtain the most recent common ancestor of any two individu- 
als and also the distance between them. To create a new generation from a 
given one we assume, as usual, that the number of offspring that each indi- 
vidual contributes to the new generation is proportional to its relative fitness, 
Wi/w where w = J2iWi is the total fitness of the population. The offspring 
has all the mutations of its parent plus a random number k of new muta- 
tions distributed according to the Poisson distribution Mk given by Eqn. (1). 
The initial population is composed of N individuals without mutations whose 
evolution we follow through typically 1000 generations. We then go backward 
in time to determine the common ancestors of each pair of individuals. We 
typically average our results over 300 independent runs. 



Here we present only the results for diminishing epistasis {a = 0.5). A more 
complete and detailed discussion will be presented elsewhere. In Fig. 1 we 
show the dependence of Tg and dg on the mutation rate U for s = 0.5 and 
several values of the population size N. One the one side, for small U we find 



that both Tg/N and dg are practically independent of A^. In fact, similarly 
to the strong-selection limit, in this regime we find Tg/N k, exp {—U/s) and 
dg ~ 2Us. On the other side, for U large we find a regime reminiscent of the 
neutral limit, in which Tg/N becomes independent of the mutation rate. In 
this case we can define an effective population size A'^e, which depends on s 
and N but not on U, such that Tg = N^ and dg ~ 2UNe- As expected, we find 
that A"e decreases with increasing s since the number of breeding individuals 
decreases with s. More specifically, N^ seems to decrease like A^^"'' for s not 
too close 1. Interestingly, as illustrated in Fig. lb the different scaling of dg 
with A^ in these two regimes leads to an abrupt increase of this quantity at a 
finite value of f/, which may signal the existence of a phase transition in the 
thermodynamic limit A^ ^ oo. We note that these features are not observed 
for a>l. 
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